SwePub
Tyck till om SwePub Sök här!
Sök i SwePub databas

  Extended search

Träfflista för sökning "db:Swepub ;pers:(Jantsch Axel);pers:(Lu Zhonghai);mspu:(article)"

Search: db:Swepub > Jantsch Axel > Lu Zhonghai > Journal article

  • Result 1-10 of 30
Sort/group result
   
EnumerationReferenceCoverFind
1.
  • Anagnostopoulos, Iraklis, et al. (author)
  • Custom Microcoded Dynamic Memory Management for Distributed On-Chip Memory Organizations
  • 2011
  • In: IEEE Embedded Systems Letters. - 1943-0663. ; 3:2, s. 66-69
  • Journal article (peer-reviewed)abstract
    • Multiprocessor system-on-chip (MPSoCs) have attracted significant attention since they are recognized as a scalable paradigm to interconnect and organize a high number of cores. Current multicore embedded systems exhibit increased levels of dynamicbehavior, leading to unexpected memory footprint variations unknown at design time.Dynamic memory management (DMM) is a promising solution for such types of dynamicsystems. Although some efficient dynamic memory managers have been proposed for conventional bus-based MPSoC platforms, there are no DMM solutions regarding the constraints and the opportunities delivered by the physical distribution of multiple memorynodes of the platform. In this work, we address the problem of providing customizedmicrocoded DMM on MPSoC platforms with distributed memory organization. Customization is enabled at application-and platform-level. Results show that customizedmicrocoded DMM can serve approximately 7× more allocation requests compared to puredistributed memory platforms and perform 25% faster than the corresponding high-level implementation in C language. 
  •  
2.
  • Chen, Xiaowen, et al. (author)
  • Cooperative communication based barrier synchronization in on-chip mesh architectures
  • 2011
  • In: IEICE Electronics Express. - : Institute of Electronics, Information and Communications Engineers (IEICE). - 1349-2543. ; 8:22, s. 1856-1862
  • Journal article (peer-reviewed)abstract
    • We propose cooperative communication as a means to enable efficient and scalable barrier synchronization on mesh-based many-core architectures. Our approach is different from but orthogonal to conventional algorithm-based optimizations. It relies on collaborating routers to provide efficient gather and multicast communication. In conjunction with a master-slave algorithm, it exploits the mesh regularity to achieve efficiency. The gather and multicast functions have been implemented in our router. Synthesis results suggest marginal area overhead. With synthetic and benchmark experiments, we show that our approach significantly reduces synchronization completion time and increases speedup.
  •  
3.
  • Chen, Xiaowen, et al. (author)
  • Cooperative communication for efficient and scalable all-to-all barrier synchronization on mesh-based many-core NoCs
  • 2014
  • In: IEICE Electronics Express. - : Institute of Electronics, Information and Communications Engineers (IEICE). - 1349-2543. ; 11:18, s. 20140542-
  • Journal article (peer-reviewed)abstract
    • On many-core Network-on-Chips (NoCs), communication is on the critical path of system performance and contended synchronization requests may cause large performance penalty. Different from conventional algorithm-based approaches, the paper addresses the barrier synchronization problem from the angle of optimizing its communication performance and proposes cooperative communication as a means to achieve efficient and scalable all-to-all barrier synchronization on mesh-based many-core NoCs. With the cooperative communication, routers collaborate with one another to accomplish a fast barrier synchronization task. The cooperative communication is implemented in our router at low cost. Through comparative experiments, our approach evidently exhibits high efficiency and good scalability.
  •  
4.
  • Chen, Xiaowen, et al. (author)
  • Hybrid distributed shared memory space in multi-core processors
  • 2011
  • In: Journal of Software. - : International Academy Publishing (IAP). - 1796-217X. ; 6:12 SPEC. ISSUE, s. 2369-2378
  • Journal article (peer-reviewed)abstract
    • On multi-core processors, memories are preferably distributed and supporting Distributed Shared Memory (DSM) is essential for the sake of reusing huge amount of legacy code and easy programming. However, the DSM organization imports the inherent overhead of translating virtual memory addresses into physical memory addresses, resulting in negative performance. We observe that, in parallel applications, different data have different properties (private or shared). For the private data accesses, it's unnecessary to perform Virtual-to-Physical address translations. Even for the same datum, its property may be changeable in different phases of the program execution. Therefore, this paper focuses on decreasing the overhead of Virtualto- Physical address translation and hence improving the system performance by introducing hybrid DSM organization and supporting run-time partitioning according to the data property. The hybrid DSM organization aims at supporting fast and physical memory accesses for private data and maintaining a global and single virtual memory space for shared data. Based on the data property of parallel applications, the run-time partitioning supports changing the hybrid DSM organization during the program execution. It ensures fast physical memory addressing on private data and conventional virtual memory addressing on shared data, improving the performance of the entire system by reducing virtual-to-physical address translation overhead as much as possible. We formulate the run-time partitioning of hybrid DSM organization in order to analyze its performance. A real DSM based multi-core platform is also constructed. The experimental results of real applications show that the hybrid DSM organization with run-time partitioning demonstrates performance advantage over the conventional DSM counterpart. The percentage of performance improvement depends on problem size, way of data partitioning and computation/communication ratio of parallel applications, network size of the system, etc. In our experiments, the maximal improvement is 34.42%, the minimal improvement 3.68%.
  •  
5.
  • Chen, Xiaowen, et al. (author)
  • Reducing Virtual-to-Physical address translation overhead in Distributed Shared Memory based multi-core Network-on-Chips according to data property
  • 2013
  • In: Computers & electrical engineering. - : Elsevier BV. - 0045-7906 .- 1879-0755. ; 39:2, s. 596-612
  • Journal article (peer-reviewed)abstract
    • In Network-on-Chip (NoC) based multi-core platforms, Distributed Shared Memory (DSM) preferably uses virtual addressing in order to hide the physical locations of the memories. However, this incurs performance penalty due to the Virtual-to-Physical (V2P) address translation overhead for all memory accesses. Based on the data property which can be either private or shared, this paper proposes a hybrid DSM which partitions a local memory into a private and a shared part. The private part is accessed directly using physical addressing and the shared part using virtual addressing. In particular, the partitioning boundary can be configured statically at design time and dynamically at runtime. The dynamic configuration further removes the V2P address translation overhead for those data with changeable property when they become private at runtime. In the experiments with three applications (matrix multiplication, 2D FFT, and H.264/AVC encoding), compared with the conventional DSM, our techniques show performance improvement up to 37.89%.
  •  
6.
  • Eslami Kiasari, Abbas, et al. (author)
  • An Analytical Latency Model for Networks-on-Chip
  • 2013
  • In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - 1063-8210 .- 1557-9999. ; 21:1, s. 113-123
  • Journal article (peer-reviewed)abstract
    • We propose an analytical model based on queueing theory for delay analysis in a wormhole-switched network-on-chip (NoC). The proposed model takes as input an application communication graph, a topology graph, a mapping vector, and a routing matrix, and estimates average packet latency and router blocking time. It works for arbitrary network topology with deterministic routing under arbitrary traffic patterns. This model can estimate per-flow average latency accurately and quickly, thus enabling fast design space exploration of various design parameters in NoC designs. Experimental results show that the proposed analytical model can predict the average packet latency more than four orders of magnitude faster than an accurate simulation, while the computation error is less than 10% in non-saturated networks for different system-on-chip platforms.
  •  
7.
  • Eslami Kiasari, Abbas, et al. (author)
  • Mathematical formalisms for performance evaluation of networks-on-chip
  • 2013
  • In: ACM Computing Surveys. - : Association for Computing Machinery (ACM). - 0360-0300 .- 1557-7341. ; 45:3, s. 38-
  • Journal article (peer-reviewed)abstract
    • This article reviews four popular mathematical formalisms-queueing theory, network calculus, schedulability analysis, anddataflow analysis-and how they have been applied to the analysis of on-chip communication performance in Systems-on-Chip. The article discusses the basic concepts and results of each formalism and provides examples of how they have been used in Networks-on-Chip (NoCs) performance analysis. Also, the respective strengths and weaknesses of each technique and its suitability for a specific purpose are investigated. An open research issue is a unified analytical model for a comprehensive performance evaluation of NoCs. To this end, this article reviews the attempts that have been made to bridge these formalisms.
  •  
8.
  • Feng, Chaochao, et al. (author)
  • A 1-Cycle 1.25 GHz Bufferless Router for 3D Network-on-Chip
  • 2012
  • In: IEICE transactions on information and systems. - 0916-8532 .- 1745-1361. ; E95D:5, s. 1519-1522
  • Journal article (peer-reviewed)abstract
    • In this paper, we propose a 1-cycle high-performance 3D bufferless router with a 3-stage permutation network. The proposed router utilizes the 3-stage permutation network instead of the serialized switch allocator and 7 x 7 crossbar to achieve the frequency of 1.25 GHz in TSMC 65 nm technology. Compared with the other two 3D bufferless routers, the proposed router occupies less area and consumes less power consumption. Simulation results under both synthetic and application workloads illustrate that the proposed router achieves less average packet latency than the other two 3D bufferless routers.
  •  
9.
  • Feng, C., et al. (author)
  • Addressing transient and permanent faults in NoC with efficient fault-tolerant deflection router
  • 2013
  • In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - 1063-8210 .- 1557-9999. ; 21:6, s. 1053-1066
  • Journal article (peer-reviewed)abstract
    • Continuing decrease in the feature size of integrated circuits leads to increases in susceptibility to transient and permanent faults. This paper proposes a fault-tolerant solution for a bufferless network-on-chip, including an on-line fault-diagnosis mechanism to detect both transient and permanent faults, a hybrid automatic repeat request, and forward error correction link-level error control scheme to handle transient faults and a reinforcement-learning-based fault-tolerant deflection routing (FTDR) algorithm to tolerate permanent faults without deadlock and livelock. A hierarchical-routing-table-based algorithm (FTDR-H) is also presented to reduce the area overhead of the FTDR router. Synthesized results show that, compared with the FTDR router, the FTDR-H router can reduce the area by 27% in an 8×8 network. Simulation results demonstrate that under synthetic workloads, in the presence of permanent link faults, the throughput of an 8×8 network with FTDR and FTDR-H algorithms are 14% and 23% higher on average than that with the fault-on-neighbor (FoN) aware deflection routing algorithm and the cost-based deflection routing algorithm, respectively. Under real application workloads, the FTDR-H algorithm achieves 20% less hop counts on average than that of the FoN algorithm. For transient faults, the performance of the FTDR router can achieve graceful degradation even at a high fault rate. We also implement the fault-tolerant deflection router which can achieve 400 MHz in TSMC 65-nm technology.
  •  
10.
  • Feng, Chaochao, et al. (author)
  • Support Efficient and Fault-Tolerant Multicast in Bufferless Network-on-Chip
  • 2012
  • In: IEICE transactions on information and systems. - 0916-8532 .- 1745-1361. ; E95D:4, s. 1052-1061
  • Journal article (peer-reviewed)abstract
    • In this paper, we propose three Deflection-Routing-based Multicast (DRM) schemes for a bufferless NoC. The DRM scheme without packets replication (DRM_noPR) sends multicast packet through a non-deterministic path. The DRM schemes with adaptive packets replication (DRM_PR_src and DRM_PR_all) replicate multicast packets at the source or intermediate node according to the destination position and the state of output ports to reduce the average multicast latency. We also provide fault-tolerant supporting in these schemes through a reinforcement-learning-based method to reconfigure the routing table to tolerate permanent faulty links in the network. Simulation results illustrate that the DRM_PR_all scheme achieves 41%, 43% and 37% less latency on average than that of the DRM_noPR scheme and 27%, 29% and 25% less latency on average than that of the DRM_PR_src scheme under three synthetic traffic patterns respectively. In addition, all three fault-tolerant DRM schemes achieve acceptable performance degradation at various link fault rates without any packet lost.
  •  
Skapa referenser, mejla, bekava och länka
  • Result 1-10 of 30

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Close

Copy and save the link in order to return to this view